{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Removing and splitting pandas DataFrame columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When you are preparing to train machine learning models, you often need to delete specific columns, or split certain columns from your DataFrame into a new DataFrame.\n",
    "\n",
    "We need the pandas library and a DataFrame to explore"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's load a bigger csv file with more columns, **flight_delays.csv** provides information about flights and flight delays"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>FL_DATE</th>\n",
       "      <th>OP_UNIQUE_CARRIER</th>\n",
       "      <th>TAIL_NUM</th>\n",
       "      <th>OP_CARRIER_FL_NUM</th>\n",
       "      <th>ORIGIN</th>\n",
       "      <th>DEST</th>\n",
       "      <th>CRS_DEP_TIME</th>\n",
       "      <th>DEP_TIME</th>\n",
       "      <th>DEP_DELAY</th>\n",
       "      <th>CRS_ARR_TIME</th>\n",
       "      <th>ARR_TIME</th>\n",
       "      <th>ARR_DELAY</th>\n",
       "      <th>CRS_ELAPSED_TIME</th>\n",
       "      <th>ACTUAL_ELAPSED_TIME</th>\n",
       "      <th>AIR_TIME</th>\n",
       "      <th>DISTANCE</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2018-10-01</td>\n",
       "      <td>WN</td>\n",
       "      <td>N221WN</td>\n",
       "      <td>802</td>\n",
       "      <td>ABQ</td>\n",
       "      <td>BWI</td>\n",
       "      <td>905</td>\n",
       "      <td>903</td>\n",
       "      <td>-2</td>\n",
       "      <td>1450</td>\n",
       "      <td>1433</td>\n",
       "      <td>-17</td>\n",
       "      <td>225</td>\n",
       "      <td>210</td>\n",
       "      <td>197</td>\n",
       "      <td>1670</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2018-10-01</td>\n",
       "      <td>WN</td>\n",
       "      <td>N8329B</td>\n",
       "      <td>3744</td>\n",
       "      <td>ABQ</td>\n",
       "      <td>BWI</td>\n",
       "      <td>1500</td>\n",
       "      <td>1458</td>\n",
       "      <td>-2</td>\n",
       "      <td>2045</td>\n",
       "      <td>2020</td>\n",
       "      <td>-25</td>\n",
       "      <td>225</td>\n",
       "      <td>202</td>\n",
       "      <td>191</td>\n",
       "      <td>1670</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2018-10-01</td>\n",
       "      <td>WN</td>\n",
       "      <td>N920WN</td>\n",
       "      <td>1019</td>\n",
       "      <td>ABQ</td>\n",
       "      <td>DAL</td>\n",
       "      <td>1800</td>\n",
       "      <td>1802</td>\n",
       "      <td>2</td>\n",
       "      <td>2045</td>\n",
       "      <td>2032</td>\n",
       "      <td>-13</td>\n",
       "      <td>105</td>\n",
       "      <td>90</td>\n",
       "      <td>80</td>\n",
       "      <td>580</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2018-10-01</td>\n",
       "      <td>WN</td>\n",
       "      <td>N480WN</td>\n",
       "      <td>1499</td>\n",
       "      <td>ABQ</td>\n",
       "      <td>DAL</td>\n",
       "      <td>950</td>\n",
       "      <td>947</td>\n",
       "      <td>-3</td>\n",
       "      <td>1235</td>\n",
       "      <td>1223</td>\n",
       "      <td>-12</td>\n",
       "      <td>105</td>\n",
       "      <td>96</td>\n",
       "      <td>81</td>\n",
       "      <td>580</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2018-10-01</td>\n",
       "      <td>WN</td>\n",
       "      <td>N227WN</td>\n",
       "      <td>3635</td>\n",
       "      <td>ABQ</td>\n",
       "      <td>DAL</td>\n",
       "      <td>1150</td>\n",
       "      <td>1151</td>\n",
       "      <td>1</td>\n",
       "      <td>1430</td>\n",
       "      <td>1423</td>\n",
       "      <td>-7</td>\n",
       "      <td>100</td>\n",
       "      <td>92</td>\n",
       "      <td>80</td>\n",
       "      <td>580</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      FL_DATE OP_UNIQUE_CARRIER TAIL_NUM  OP_CARRIER_FL_NUM ORIGIN DEST  \\\n",
       "0  2018-10-01                WN   N221WN                802    ABQ  BWI   \n",
       "1  2018-10-01                WN   N8329B               3744    ABQ  BWI   \n",
       "2  2018-10-01                WN   N920WN               1019    ABQ  DAL   \n",
       "3  2018-10-01                WN   N480WN               1499    ABQ  DAL   \n",
       "4  2018-10-01                WN   N227WN               3635    ABQ  DAL   \n",
       "\n",
       "   CRS_DEP_TIME  DEP_TIME  DEP_DELAY  CRS_ARR_TIME  ARR_TIME  ARR_DELAY  \\\n",
       "0           905       903         -2          1450      1433        -17   \n",
       "1          1500      1458         -2          2045      2020        -25   \n",
       "2          1800      1802          2          2045      2032        -13   \n",
       "3           950       947         -3          1235      1223        -12   \n",
       "4          1150      1151          1          1430      1423         -7   \n",
       "\n",
       "   CRS_ELAPSED_TIME  ACTUAL_ELAPSED_TIME  AIR_TIME  DISTANCE  \n",
       "0               225                  210       197      1670  \n",
       "1               225                  202       191      1670  \n",
       "2               105                   90        80       580  \n",
       "3               105                   96        81       580  \n",
       "4               100                   92        80       580  "
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "delays_df = pd.read_csv('Data/flight_delays.csv')\n",
    "delays_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Removing a column from a DataFrame.\n",
    "\n",
    "When you are preparing your data for machine learning, you may need to delete specific columns from the DataFrame before training the model.\n",
    "\n",
    "For example:\n",
    "Imagine you are training a model to predict how many minutes late a flight will be (ARR_DELAY)\n",
    "\n",
    "If the model knew the scheduled arrival time (CRS_ARR_TIME) and the actual arrival time (ARR_TIME), the model would quickly figure out ARR_DELAY = ARR_TIME - CRS_ARR_TIME\n",
    "\n",
    "When we predict arrival times for future flights, we won't have a value for  arrival time (ARR_TIME). So we should remove this column from the DataFrame so it is not used as a feature when training the model to predict ARR_DELAY.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>FL_DATE</th>\n",
       "      <th>OP_UNIQUE_CARRIER</th>\n",
       "      <th>TAIL_NUM</th>\n",
       "      <th>OP_CARRIER_FL_NUM</th>\n",
       "      <th>ORIGIN</th>\n",
       "      <th>DEST</th>\n",
       "      <th>CRS_DEP_TIME</th>\n",
       "      <th>DEP_TIME</th>\n",
       "      <th>DEP_DELAY</th>\n",
       "      <th>CRS_ARR_TIME</th>\n",
       "      <th>ARR_DELAY</th>\n",
       "      <th>CRS_ELAPSED_TIME</th>\n",
       "      <th>ACTUAL_ELAPSED_TIME</th>\n",
       "      <th>AIR_TIME</th>\n",
       "      <th>DISTANCE</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2018-10-01</td>\n",
       "      <td>WN</td>\n",
       "      <td>N221WN</td>\n",
       "      <td>802</td>\n",
       "      <td>ABQ</td>\n",
       "      <td>BWI</td>\n",
       "      <td>905</td>\n",
       "      <td>903</td>\n",
       "      <td>-2</td>\n",
       "      <td>1450</td>\n",
       "      <td>-17</td>\n",
       "      <td>225</td>\n",
       "      <td>210</td>\n",
       "      <td>197</td>\n",
       "      <td>1670</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2018-10-01</td>\n",
       "      <td>WN</td>\n",
       "      <td>N8329B</td>\n",
       "      <td>3744</td>\n",
       "      <td>ABQ</td>\n",
       "      <td>BWI</td>\n",
       "      <td>1500</td>\n",
       "      <td>1458</td>\n",
       "      <td>-2</td>\n",
       "      <td>2045</td>\n",
       "      <td>-25</td>\n",
       "      <td>225</td>\n",
       "      <td>202</td>\n",
       "      <td>191</td>\n",
       "      <td>1670</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2018-10-01</td>\n",
       "      <td>WN</td>\n",
       "      <td>N920WN</td>\n",
       "      <td>1019</td>\n",
       "      <td>ABQ</td>\n",
       "      <td>DAL</td>\n",
       "      <td>1800</td>\n",
       "      <td>1802</td>\n",
       "      <td>2</td>\n",
       "      <td>2045</td>\n",
       "      <td>-13</td>\n",
       "      <td>105</td>\n",
       "      <td>90</td>\n",
       "      <td>80</td>\n",
       "      <td>580</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2018-10-01</td>\n",
       "      <td>WN</td>\n",
       "      <td>N480WN</td>\n",
       "      <td>1499</td>\n",
       "      <td>ABQ</td>\n",
       "      <td>DAL</td>\n",
       "      <td>950</td>\n",
       "      <td>947</td>\n",
       "      <td>-3</td>\n",
       "      <td>1235</td>\n",
       "      <td>-12</td>\n",
       "      <td>105</td>\n",
       "      <td>96</td>\n",
       "      <td>81</td>\n",
       "      <td>580</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2018-10-01</td>\n",
       "      <td>WN</td>\n",
       "      <td>N227WN</td>\n",
       "      <td>3635</td>\n",
       "      <td>ABQ</td>\n",
       "      <td>DAL</td>\n",
       "      <td>1150</td>\n",
       "      <td>1151</td>\n",
       "      <td>1</td>\n",
       "      <td>1430</td>\n",
       "      <td>-7</td>\n",
       "      <td>100</td>\n",
       "      <td>92</td>\n",
       "      <td>80</td>\n",
       "      <td>580</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      FL_DATE OP_UNIQUE_CARRIER TAIL_NUM  OP_CARRIER_FL_NUM ORIGIN DEST  \\\n",
       "0  2018-10-01                WN   N221WN                802    ABQ  BWI   \n",
       "1  2018-10-01                WN   N8329B               3744    ABQ  BWI   \n",
       "2  2018-10-01                WN   N920WN               1019    ABQ  DAL   \n",
       "3  2018-10-01                WN   N480WN               1499    ABQ  DAL   \n",
       "4  2018-10-01                WN   N227WN               3635    ABQ  DAL   \n",
       "\n",
       "   CRS_DEP_TIME  DEP_TIME  DEP_DELAY  CRS_ARR_TIME  ARR_DELAY  \\\n",
       "0           905       903         -2          1450        -17   \n",
       "1          1500      1458         -2          2045        -25   \n",
       "2          1800      1802          2          2045        -13   \n",
       "3           950       947         -3          1235        -12   \n",
       "4          1150      1151          1          1430         -7   \n",
       "\n",
       "   CRS_ELAPSED_TIME  ACTUAL_ELAPSED_TIME  AIR_TIME  DISTANCE  \n",
       "0               225                  210       197      1670  \n",
       "1               225                  202       191      1670  \n",
       "2               105                   90        80       580  \n",
       "3               105                   96        81       580  \n",
       "4               100                   92        80       580  "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Remove the column ARR_TIME from the DataFrane delays_df\n",
    "\n",
    "#delays_df = delays_df.drop(['ARR_TIME'],axis=1)\n",
    "new_df = delays_df.drop(columns=['ARR_TIME'])\n",
    "new_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use the **inplace** parameter to specify you want to drop the column from the original DataFrame"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>FL_DATE</th>\n",
       "      <th>OP_UNIQUE_CARRIER</th>\n",
       "      <th>TAIL_NUM</th>\n",
       "      <th>OP_CARRIER_FL_NUM</th>\n",
       "      <th>ORIGIN</th>\n",
       "      <th>DEST</th>\n",
       "      <th>CRS_DEP_TIME</th>\n",
       "      <th>DEP_TIME</th>\n",
       "      <th>DEP_DELAY</th>\n",
       "      <th>CRS_ARR_TIME</th>\n",
       "      <th>ARR_DELAY</th>\n",
       "      <th>CRS_ELAPSED_TIME</th>\n",
       "      <th>ACTUAL_ELAPSED_TIME</th>\n",
       "      <th>AIR_TIME</th>\n",
       "      <th>DISTANCE</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2018-10-01</td>\n",
       "      <td>WN</td>\n",
       "      <td>N221WN</td>\n",
       "      <td>802</td>\n",
       "      <td>ABQ</td>\n",
       "      <td>BWI</td>\n",
       "      <td>905</td>\n",
       "      <td>903</td>\n",
       "      <td>-2</td>\n",
       "      <td>1450</td>\n",
       "      <td>-17</td>\n",
       "      <td>225</td>\n",
       "      <td>210</td>\n",
       "      <td>197</td>\n",
       "      <td>1670</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2018-10-01</td>\n",
       "      <td>WN</td>\n",
       "      <td>N8329B</td>\n",
       "      <td>3744</td>\n",
       "      <td>ABQ</td>\n",
       "      <td>BWI</td>\n",
       "      <td>1500</td>\n",
       "      <td>1458</td>\n",
       "      <td>-2</td>\n",
       "      <td>2045</td>\n",
       "      <td>-25</td>\n",
       "      <td>225</td>\n",
       "      <td>202</td>\n",
       "      <td>191</td>\n",
       "      <td>1670</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2018-10-01</td>\n",
       "      <td>WN</td>\n",
       "      <td>N920WN</td>\n",
       "      <td>1019</td>\n",
       "      <td>ABQ</td>\n",
       "      <td>DAL</td>\n",
       "      <td>1800</td>\n",
       "      <td>1802</td>\n",
       "      <td>2</td>\n",
       "      <td>2045</td>\n",
       "      <td>-13</td>\n",
       "      <td>105</td>\n",
       "      <td>90</td>\n",
       "      <td>80</td>\n",
       "      <td>580</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2018-10-01</td>\n",
       "      <td>WN</td>\n",
       "      <td>N480WN</td>\n",
       "      <td>1499</td>\n",
       "      <td>ABQ</td>\n",
       "      <td>DAL</td>\n",
       "      <td>950</td>\n",
       "      <td>947</td>\n",
       "      <td>-3</td>\n",
       "      <td>1235</td>\n",
       "      <td>-12</td>\n",
       "      <td>105</td>\n",
       "      <td>96</td>\n",
       "      <td>81</td>\n",
       "      <td>580</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2018-10-01</td>\n",
       "      <td>WN</td>\n",
       "      <td>N227WN</td>\n",
       "      <td>3635</td>\n",
       "      <td>ABQ</td>\n",
       "      <td>DAL</td>\n",
       "      <td>1150</td>\n",
       "      <td>1151</td>\n",
       "      <td>1</td>\n",
       "      <td>1430</td>\n",
       "      <td>-7</td>\n",
       "      <td>100</td>\n",
       "      <td>92</td>\n",
       "      <td>80</td>\n",
       "      <td>580</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      FL_DATE OP_UNIQUE_CARRIER TAIL_NUM  OP_CARRIER_FL_NUM ORIGIN DEST  \\\n",
       "0  2018-10-01                WN   N221WN                802    ABQ  BWI   \n",
       "1  2018-10-01                WN   N8329B               3744    ABQ  BWI   \n",
       "2  2018-10-01                WN   N920WN               1019    ABQ  DAL   \n",
       "3  2018-10-01                WN   N480WN               1499    ABQ  DAL   \n",
       "4  2018-10-01                WN   N227WN               3635    ABQ  DAL   \n",
       "\n",
       "   CRS_DEP_TIME  DEP_TIME  DEP_DELAY  CRS_ARR_TIME  ARR_DELAY  \\\n",
       "0           905       903         -2          1450        -17   \n",
       "1          1500      1458         -2          2045        -25   \n",
       "2          1800      1802          2          2045        -13   \n",
       "3           950       947         -3          1235        -12   \n",
       "4          1150      1151          1          1430         -7   \n",
       "\n",
       "   CRS_ELAPSED_TIME  ACTUAL_ELAPSED_TIME  AIR_TIME  DISTANCE  \n",
       "0               225                  210       197      1670  \n",
       "1               225                  202       191      1670  \n",
       "2               105                   90        80       580  \n",
       "3               105                   96        81       580  \n",
       "4               100                   92        80       580  "
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Remove the column ARR_TIME from the DataFrame delays_df\n",
    "\n",
    "#delays_df = delays_df.drop(['ARR_TIME'],axis=1)\n",
    "delays_df.drop(columns=['ARR_TIME'], inplace=True)\n",
    "delays_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We use different techniques to predict based on quantititative values which are usually numeric values (e.g. distance, number of minutes, weight) and qualitative (descriptive) values which may not be numeric (e.g. what airport a flight left from, what airline operated the flight)\n",
    "\n",
    "Quantitative data may be moved into a separate DataFrame before training a model.\n",
    "\n",
    "You also need to put the value you want to predict, called the label (ARR_DELAY) in a separate DataFrame from the values you think can help you make the prediction, called the features\n",
    "\n",
    "We need to be able to create a new dataframe from the columns in an existing dataframe"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# Create a new DataFrame called desc_df\n",
    "# include all rows\n",
    "# include the columns ORIGIN, DEST, OP_CARRIER_FL_NUM, OP_UNIQUE_CARRIER, TAIL_NUM\n",
    "\n",
    "desc_df = delays_df.loc[:,['ORIGIN', 'DEST', 'OP_CARRIER_FL_NUM', 'OP_UNIQUE_CARRIER', 'TAIL_NUM']]\n",
    "desc_df.head()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
